190 research outputs found
A Geometric Approach to Sound Source Localization from Time-Delay Estimates
This paper addresses the problem of sound-source localization from time-delay
estimates using arbitrarily-shaped non-coplanar microphone arrays. A novel
geometric formulation is proposed, together with a thorough algebraic analysis
and a global optimization solver. The proposed model is thoroughly described
and evaluated. The geometric analysis, stemming from the direct acoustic
propagation model, leads to necessary and sufficient conditions for a set of
time delays to correspond to a unique position in the source space. Such sets
of time delays are referred to as feasible sets. We formally prove that every
feasible set corresponds to exactly one position in the source space, whose
value can be recovered using a closed-form localization mapping. Therefore we
seek for the optimal feasible set of time delays given, as input, the received
microphone signals. This time delay estimation problem is naturally cast into a
programming task, constrained by the feasibility conditions derived from the
geometric analysis. A global branch-and-bound optimization technique is
proposed to solve the problem at hand, hence estimating the best set of
feasible time delays and, subsequently, localizing the sound source. Extensive
experiments with both simulated and real data are reported; we compare our
methodology to four state-of-the-art techniques. This comparison clearly shows
that the proposed method combined with the branch-and-bound algorithm
outperforms existing methods. These in-depth geometric understanding, practical
algorithms, and encouraging results, open several opportunities for future
work.Comment: 13 pages, 2 figures, 3 table, journa
Switching Variational Auto-Encoders for Noise-Agnostic Audio-visual Speech Enhancement
Recently, audio-visual speech enhancement has been tackled in the
unsupervised settings based on variational auto-encoders (VAEs), where during
training only clean data is used to train a generative model for speech, which
at test time is combined with a noise model, e.g. nonnegative matrix
factorization (NMF), whose parameters are learned without supervision.
Consequently, the proposed model is agnostic to the noise type. When visual
data are clean, audio-visual VAE-based architectures usually outperform the
audio-only counterpart. The opposite happens when the visual data are corrupted
by clutter, e.g. the speaker not facing the camera. In this paper, we propose
to find the optimal combination of these two architectures through time. More
precisely, we introduce the use of a latent sequential variable with Markovian
dependencies to switch between different VAE architectures through time in an
unsupervised manner: leading to switching variational auto-encoder (SwVAE). We
propose a variational factorization to approximate the computationally
intractable posterior distribution. We also derive the corresponding
variational expectation-maximization algorithm to estimate the parameters of
the model and enhance the speech signal. Our experiments demonstrate the
promising performance of SwVAE.Comment: 2021 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP
EM Algorithms for Weighted-Data Clustering with Application to Audio-Visual Scene Analysis
Data clustering has received a lot of attention and numerous methods,
algorithms and software packages are available. Among these techniques,
parametric finite-mixture models play a central role due to their interesting
mathematical properties and to the existence of maximum-likelihood estimators
based on expectation-maximization (EM). In this paper we propose a new mixture
model that associates a weight with each observed point. We introduce the
weighted-data Gaussian mixture and we derive two EM algorithms. The first one
considers a fixed weight for each observation. The second one treats each
weight as a random variable following a gamma distribution. We propose a model
selection method based on a minimum message length criterion, provide a weight
initialization strategy, and validate the proposed algorithms by comparing them
with several state of the art parametric and non-parametric clustering
techniques. We also demonstrate the effectiveness and robustness of the
proposed clustering technique in the presence of heterogeneous data, namely
audio-visual scene analysis.Comment: 14 pages, 4 figures, 4 table
Online Localization and Tracking of Multiple Moving Speakers in Reverberant Environments
We address the problem of online localization and tracking of multiple moving
speakers in reverberant environments. The paper has the following
contributions. We use the direct-path relative transfer function (DP-RTF), an
inter-channel feature that encodes acoustic information robust against
reverberation, and we propose an online algorithm well suited for estimating
DP-RTFs associated with moving audio sources. Another crucial ingredient of the
proposed method is its ability to properly assign DP-RTFs to audio-source
directions. Towards this goal, we adopt a maximum-likelihood formulation and we
propose to use an exponentiated gradient (EG) to efficiently update
source-direction estimates starting from their currently available values. The
problem of multiple speaker tracking is computationally intractable because the
number of possible associations between observed source directions and physical
speakers grows exponentially with time. We adopt a Bayesian framework and we
propose a variational approximation of the posterior filtering distribution
associated with multiple speaker tracking, as well as an efficient variational
expectation-maximization (VEM) solver. The proposed online localization and
tracking method is thoroughly evaluated using two datasets that contain
recordings performed in real environments.Comment: IEEE Journal of Selected Topics in Signal Processing, 201
Cross-Paced Representation Learning with Partial Curricula for Sketch-based Image Retrieval
In this paper we address the problem of learning robust cross-domain
representations for sketch-based image retrieval (SBIR). While most SBIR
approaches focus on extracting low- and mid-level descriptors for direct
feature matching, recent works have shown the benefit of learning coupled
feature representations to describe data from two related sources. However,
cross-domain representation learning methods are typically cast into non-convex
minimization problems that are difficult to optimize, leading to unsatisfactory
performance. Inspired by self-paced learning, a learning methodology designed
to overcome convergence issues related to local optima by exploiting the
samples in a meaningful order (i.e. easy to hard), we introduce the cross-paced
partial curriculum learning (CPPCL) framework. Compared with existing
self-paced learning methods which only consider a single modality and cannot
deal with prior knowledge, CPPCL is specifically designed to assess the
learning pace by jointly handling data from dual sources and modality-specific
prior information provided in the form of partial curricula. Additionally,
thanks to the learned dictionaries, we demonstrate that the proposed CPPCL
embeds robust coupled representations for SBIR. Our approach is extensively
evaluated on four publicly available datasets (i.e. CUFS, Flickr15K, QueenMary
SBIR and TU-Berlin Extension datasets), showing superior performance over
competing SBIR methods
SocialInteractionGAN: Multi-person Interaction Sequence Generation
Prediction of human actions in social interactions has important applications
in the design of social robots or artificial avatars. In this paper, we model
human interaction generation as a discrete multi-sequence generation problem
and present SocialInteractionGAN, a novel adversarial architecture for
conditional interaction generation. Our model builds on a recurrent
encoder-decoder generator network and a dual-stream discriminator. This
architecture allows the discriminator to jointly assess the realism of
interactions and that of individual action sequences. Within each stream a
recurrent network operating on short subsequences endows the output signal with
local assessments, better guiding the forthcoming generation. Crucially,
contextual information on interacting participants is shared among agents and
reinjected in both the generation and the discriminator evaluation processes.
We show that the proposed SocialInteractionGAN succeeds in producing high
realism action sequences of interacting people, comparing favorably to a
diversity of recurrent and convolutional discriminator baselines. Evaluations
are conducted using modified Inception Score and Fr{\'e}chet Inception Distance
metrics, that we specifically design for discrete sequential generated data.
The distribution of generated sequences is shown to approach closely that of
real data. In particular our model properly learns the dynamics of interaction
sequences, while exploiting the full range of actions
- …